ggplot2
::ggplot() flyovercolor vs. fill:: function to reference functions
inside packages.
select is a very common function name. Usually we need
to use dplyr::select## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 5 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 6 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 5 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
## 6 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68
ggplot() statementgeom_[style]() such as
geom_point()geom_bar()geom_boxplot()geom_density()geom_vline()geom_segment()geom_histogram()aes( )aes (except facets, later in
slides), and all constants go outside of the aes.
geom_point(aes(color = gender))
vs. geom_point(color = "red")+ symbol
%>% between layers of
ggplot2 graphics+ is equivalent of “add layer on top of …” in
ggplot2 portions, whereas %>% is “and then
the next step is…”Steps 4 and 5 can be switched.
Let’s look at our BabyNames names data set agian.
## Rows: 1,792,091
## Columns: 4
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida"…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ count <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258…
## $ year <int> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
names <- c("Olivia", "Zoe", "Quentin")
Names <-
BabyNames %>%
filter(name %in% names) %>%
group_by(name, year) %>%
summarise(total = sum(count, na.rm = TRUE))
Names %>%
head()## # A tibble: 6 × 3
## # Groups: name [1]
## name year total
## <chr> <int> <int>
## 1 Olivia 1880 44
## 2 Olivia 1881 51
## 3 Olivia 1882 52
## 4 Olivia 1883 46
## 5 Olivia 1884 54
## 6 Olivia 1885 59
The graph looks perfectly fine, but this code isn’t easy to read.
This is why we stress writing readable code!
ggplot(data = Names, aes(x = year, y = total)) + geom_line() + aes(colour = name) + theme(legend.position = "right") + labs(title = "")Nothing is here! That is exactly what is supposed to happen. Calling
ggplot() only tells us R that we are ready to plot and I
want to create some space to make my plot.
Still Nothing! We need to tell it what our axis are.
Note that ggplot uses +, NOT %>%. This
is because we are adding layers to our plots.
## Error in `geom_line()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_line()` requires the following missing aesthetics: x and y
Note - this is why I like to map aesthetics first, so we can avoid errors.
Rule of thumb: anytime when you are plotting with ggplot, ALL
variables need to be inside an aes (except facets, later in
slides), and all constants go outside of the aes.
#add color
# note that color includes the groups argument but not vice versa!
ggplot(data = Names) +
geom_line( aes(x = year, y = total, color = name)) ggplot(data = Names) +
geom_line( aes(x = year, y = total, color = name)) +
ggtitle("Names Over Time") +
xlab("Year") +
ylab("Popularity") +
guides(color = guide_legend(title = "Siblings Names" ))ggplot(data = Names) +
geom_line( aes(x = year, y = total, color = name, linetype = name)) +
ggtitle("Names Over Time") +
xlim(c(1972, 2022))+
xlab("Year") +
ylab("Popularity") +
guides(color = guide_legend(title = "Siblings Names" ),
linetype = guide_legend(title = "Still Siblings Names" ))## Warning: Removed 252 rows containing missing values (`geom_line()`).
facet_wrap()The syntax for facets requires a formula syntax we haven’t seen much yet. There are two main ways to plot with facets. Here are a few pointers:
facet_wrap() just makes a seperate plot for each level
of the categorical variable
facet_wrap( ~ categoricalVariable)data("NCHS")
# `!is.na(smoker)` finds cases that are non-missing for `smoker` (i.e. removes NA's)
Heights <-
NCHS %>%
filter(age > 20, !is.na(smoker)) %>%
group_by(sex, smoker, age) %>%
summarise(height = mean(height, na.rm = TRUE))
head(Heights)## # A tibble: 6 × 4
## # Groups: sex, smoker [1]
## sex smoker age height
## <fct> <fct> <dbl> <dbl>
## 1 female no 21 1.60
## 2 female no 22 1.62
## 3 female no 23 1.61
## 4 female no 24 1.62
## 5 female no 25 1.63
## 6 female no 26 1.62
Heights %>%
ggplot(aes(x = age, y = height)) +
geom_line(aes(linetype = smoker)) +
facet_wrap( ~ sex)facet_grid() allows control of row & column
facetsfacet_grid() syntax:
facet_grid(rows ~ cols)facet_grid( rows ~ . ) (note the
required “.”)facet_grid( ~ cols) (no
“.” this time)Heights %>%
ggplot(aes(x = age, y = height)) +
geom_line(aes(linetype = smoker)) +
facet_grid(sex ~ .)color and fill## wage educ race sex hispanic south married exper union age sector
## 1 9.0 10 W M NH NS Married 27 Not 43 const
## 2 5.5 12 W M NH NS Married 20 Not 38 sales
## 3 3.8 12 W F NH NS Single 4 Not 22 sales
## 4 10.5 12 W F NH NS Married 29 Not 47 clerical
## 5 15.0 12 W M NH NS Married 40 Union 58 const
## 6 9.0 16 W F NH NS Married 27 Not 49 clerical
CPS85 %>%
ggplot() +
geom_density(aes(x = wage, color = sex), alpha = 0.4)+
facet_grid( ~ married) +
xlim(0,30) ## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
CPS85 %>%
ggplot() +
geom_density(aes(x = wage, fill = sex), alpha = 0.4)+
facet_grid( ~ married) +
xlim(0,30) ## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
CPS85 %>%
ggplot() +
geom_density(aes(x = wage, fill = sex, color = sex), alpha = 0.4)+
facet_grid( ~ married) +
xlim(0,30)## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
CPS85%>%
ggplot(aes(x = married, color = sex)) +
geom_bar() +
facet_wrap( ~ union, scales = "free") #Note the scales here CPS85%>%
ggplot(aes(x = married, fill = sex)) +
geom_bar()+
facet_wrap( ~ union, scales = "free") #Note the scales here establish the frame
plot the glyphs (i.e., select a geom)
map the aesthetics
add labels and title
other features (e.g., alpha, sizing, etc)
Establish the Frame
## Error in `geom_point()`:
## ! Problem while setting up geom.
## ℹ Error occurred in the 1st layer.
## Caused by error in `compute_geom_1()`:
## ! `geom_point()` requires the following missing aesthetics: x and y
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = depth), alpha = 0.5, size = 1) +
ggtitle("Diamonds Data") +
xlab("Carat") +
ylab("Price")Notice that I can have aes inside multiple statements.
Notice that when I use constants (like
alpha = 0.3, size = 0.1) they ARE NOT inside
aes.
In general, variables go inside aes and constants go
outside of it. (unless we are using facets then see previous
materials.)
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(aes(colour = depth), alpha = 0.3, size = 0.1) +
ggtitle("Diamonds Data") +
xlab("Carat") +
ylab("Price") +
facet_grid( cut ~ color)ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(colour = "red", alpha = 0.3, size = 0.1) +
ggtitle("Diamonds Data") +
xlab("Carat") +
ylab("Price") +
facet_grid( cut ~ color)aesaes can either go inside the ggplot()
function, or inside the geom_[chart]() function itself, or
both. The 3 following options create the same plots, but the code is
slightly different.
#option 1
ggplot(data = diamonds, ) +
geom_point(aes(x = carat, y = price, color = clarity),
alpha = 0.2,
size = 1) +
geom_smooth(method = "glm" ,
formula = y ~ poly(x, 2), # y = b_0 + b_1 x + b_2 x^2 + e
aes(x = carat, y = price),
color = "red") +
ylim(c(0, 20000))#option 2
ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.2,
size = 1) +
geom_smooth(method = "glm" ,
formula = y ~ poly(x, 2), # y = b_0 + b_1 x + b_2 x^2 + e
aes(x = carat, y = price),
color = "red") +
ylim(c(0, 20000))#Option 3
ggplot(data = diamonds, aes(x = carat, y = price) )+
geom_point( aes(color = clarity),
alpha = 0.2,
size = 1) +
geom_smooth(method = "glm" ,
formula = y ~ poly(x, 2), # y = b_0 + b_1 x + b_2 x^2 + e
color = "red") +
ylim(c(0, 20000))I personally prefer to put “global” aesthetics in the
ggplot() and “local” aesthetics in the
geom.
x and ycolor = clarity is not needed for
geom_smoothgeom_point and geom_smooth use
x and y so I put them in the
ggplot()geom_point uses color = clarity so I
put that ONLY in the geom_point functionIn my opinion, Option 3 is the “cleanest” code. This is partly based on stylistic preference and partly based on some internal mechanic of ggplot’s (that is beyond the scope of this course). How you write your code is up to you. Just keep it readable!
But again, all 3 codes generate the the exact same plot (so does it really matter that much which option we use??)